Lecture 6: Column transformer and text features

Firas Moosvi (Slides adapted from Varada Kolhatkar)

Announcements

  • Lecture recordings for the first two weeks have been made available - See Piazza.
  • My Office Hours
  • HW3 is due next week Tuesday, Oct 1st, 11:59 pm.
    • You can work in pairs for this assignment.

Quick Correction on Exercise 5.3

I accidentally said only D is true, but B is also true!

Select all of the following statements which are TRUE.

    1. You can have scaling of numeric features, one-hot encoding of categorical features, and scikit-learn estimator within a single pipeline.
    1. Once you have a scikit-learn pipeline object with an estimator as the last step, you can call fit, predict, and score on it.
    1. You can carry out data splitting within scikit-learn pipeline.
    1. We have to be careful of the order we put each transformation and model in a pipeline.

Recap: Preprocessing mistakes

Data

X, y = make_blobs(n_samples=100, centers=3, random_state=12, cluster_std=5) # make synthetic data
X_train_toy, X_test_toy, y_train_toy, y_test_toy = train_test_split(
    X, y, random_state=5, test_size=0.4) # split it into training and test sets
# Visualize the training data
plt.scatter(X_train_toy[:, 0], X_train_toy[:, 1], label="Training set", s=60)
plt.scatter(
    X_test_toy[:, 0], X_test_toy[:, 1], color=mglearn.cm2(1), label="Test set", s=60
)
plt.legend(loc="upper right")

❌ Bad ML 1

  • What’s wrong with the approach below?
scaler = StandardScaler() # Creating a scalert object 
scaler.fit(X_train_toy) # Calling fit on the training data 
train_scaled = scaler.transform(
    X_train_toy
)  # Transforming the training data using the scaler fit on training data

scaler = StandardScaler()  # Creating a separate object for scaling test data
scaler.fit(X_test_toy)  # Calling fit on the test data
test_scaled = scaler.transform(
    X_test_toy
)  # Transforming the test data using the scaler fit on test data

knn = KNeighborsClassifier()
knn.fit(train_scaled, y_train_toy)
print(f"Training score: {knn.score(train_scaled, y_train_toy):.2f}")
print(f"Test score: {knn.score(test_scaled, y_test_toy):.2f}") # misleading scores
Training score: 0.63
Test score: 0.60

Scaling train and test data separately

❌ Bad ML 2

  • What’s wrong with the approach below?
# join the train and test sets back together
XX = np.vstack((X_train_toy, X_test_toy))

scaler = StandardScaler()
scaler.fit(XX)
XX_scaled = scaler.transform(XX)

XX_train = XX_scaled[:X_train_toy.shape[0]]
XX_test = XX_scaled[X_train_toy.shape[0]:]

knn = KNeighborsClassifier()
knn.fit(XX_train, y_train_toy)
print(f"Training score: {knn.score(XX_train, y_train_toy):.2f}")  # Misleading score
print(f"Test score: {knn.score(XX_test, y_test_toy):.2f}")  # Misleading score
Training score: 0.63
Test score: 0.55

❌ Bad ML 3

  • What’s wrong with the approach below?
knn = KNeighborsClassifier()

scaler = StandardScaler()
scaler.fit(X_train_toy)
X_train_scaled = scaler.transform(X_train_toy)
X_test_scaled = scaler.transform(X_test_toy)
cross_val_score(knn, X_train_scaled, y_train_toy)
array([0.25      , 0.5       , 0.58333333, 0.58333333, 0.41666667])

Improper preprocessing

Proper preprocessing

Recap: sklearn Pipelines

  • Pipeline is a way to chain multiple steps (e.g., preprocessing + model fitting) into a single workflow.
  • Simplify the code and improves readability.
  • Reduce the risk of data leakage by ensuring proper transformation of the training and test sets.
  • Automatically apply transformations in sequence.
  • Example:
    • Chaining a StandardScaler with a KNeighborsClassifier model.
from sklearn.pipeline import make_pipeline

pipe_knn = make_pipeline(StandardScaler(), KNeighborsClassifier())

# Correct way to do cross validation without breaking the golden rule. 
cross_val_score(pipe_knn, X_train_toy, y_train_toy) 
array([0.25      , 0.5       , 0.5       , 0.58333333, 0.41666667])

Group Work: Class Demo & Live Coding

For this demo, each student should click this link to create a new repo in their accounts, then clone that repo locally to follow along with the demo from today.




If you really don’t want to create a repo,

  • Navigate to the cpsc330-2024W1 repo
  • run git pull to pull the latest files in the course repo
  • Look for the demo file here: lectures/102-Firas-lectures/class_demos/.

sklearn’s ColumnTransformer

  • Use ColumnTransformer to build all our transformations together into one object
  • Use a column transformer with sklearn pipelines.

(iClicker) Exercise 6.1

iClicker cloud join link: https://join.iclicker.com/VYFJ

Select all of the following statements which are TRUE.

    1. You could carry out cross-validation by passing a ColumnTransformer object to cross_validate.
    1. After applying column transformer, the order of the columns in the transformed data has to be the same as the order of the columns in the original data.
    1. After applying a column transformer, the transformed data is always going to be of different shape than the original data.
    1. When you call fit_transform on a ColumnTransformer object, you get a numpy ndarray.

More preprocessing

Remarks on preprocessing

  • There is no one-size-fits-all solution in data preprocessing, and decisions often involve a degree of subjectivity.
    • Exploratory data analysis and domain knowledge inform these decisions
  • Always consider the specific goals of your project when deciding how to encode features.

Alternative methods for scaling

  • StandardScaler
    • Good choice when the column follows a normal distribution or a distribution somewhat like a normal distribution.
  • MinMaxScaler: Transform each feature to a desired range. Appropriate when
    • Good choice for features such as human age, where there is a fixed range of values and the feature is uniformly distributed across the range
  • Normalizer: Works on rows rather than columns. Normalize examples individually to unit norm.
    • Good choice for frequency-type data
  • Log scaling
    • Good choice for features such as ratings per movies (power law distribution; a few movies have lots of ratings but most movies have very few ratings)

Ordinal encoding vs. One-hot encoding

  • Ordinal Encoding: Encodes categorical features as an integer array.
  • One-hot Encoding: Creates binary columns for each category’s presence.
  • Sometimes how we encode a specific feature depends upon the context.

Ordinal encoding vs. One-hot encoding

  • Consider weather feature and its four categories: Sunny (☀️), Cloudy (🌥️), Rainy (⛈️), Snowy (❄️)
  • Which encoding would you use in each of the following scenarios?
    • Predicting traffic volume
    • Predicting severity of weather-related road incidents

Ordinal encoding vs. One-hot encoding

  • Consider weather feature and its four categories: Sunny (☀️), Cloudy (🌥️), Rainy (⛈️), Snowy (❄️)
  • Predicting traffic volume: Using one-hot encoding would make sense here because the impact of different weather conditions on traffic volume does not necessarily follow a clear order and different weather conditions could have very distinct effects.
  • Predicting severity of weather-related road incidents: An ordinal encoding might be more appropriate if you define your weather categories from least to most severe as this could correlate directly with the likelihood or severity of incidents.

handle_unknown = "ignore" of OneHotEncoder

  • Use handle_unknown='ignore' with OneHotEncoder to safely ignore unseen categories during transform.
  • In each of the following scenarios, identify whether it’s a reasonable strategy or not.
    • Example 1: Suppose you are building a model to predict customer behavior (e.g., purchase likelihood) based on features like location, device_type, and product_category. During training, you have observed a set of categories for product_category, but in the future, new product categories might be added.
    • Example 2: You’re building a model to predict disease diagnosis based on symptoms, where each symptom is categorized (e.g., fever, headache, nausea).

handle_unknown = "ignore" of OneHotEncoder

  • Reasonable use: When unseen categories are less likely to impact the model’s prediction accuracy (e.g., product categories in e-commerce), and you prefer to avoid breaking the model.
  • Not-so-reasonable use: When unseen categories could provide critical new information that could significantly alter predictions (e.g., in medical diagnostics), ignoring them could result in a poor or dangerous outcome.

drop="if_binary" argument of OneHotEncoder

  • drop=‘if_binary’ argument in OneHotEncoder:
  • Reduces redundancy by dropping one of the columns if the feature is binary.

Categorical variables with too many categories

  • Strategies for categorical variables with too many categories:
    • Dimensionality reduction techniques
    • Bucketing categories into ‘others’
    • Clustering or grouping categories manually
    • Only considering top-N categories

Dealing with text features

  • Preprocessing text to fit into machine learning models using text vectorization.
  • Bag of words representation

sklearn CountVectorizer

  • Use scikit-learn’s CountVectorizer to encode text data
  • CountVectorizer: Transforms text into a matrix of token counts
  • Important parameters:
    • max_features: Control the number of features used in the model
    • max_df, min_df: Control document frequency thresholds
    • ngram_range: Defines the range of n-grams to be extracted
    • stop_words: Enables the removal of common words that are typically uninformative in most applications, such as “and”, “the”, etc.

Incorporating text features in a machine learning pipeline

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline

text_pipeline = make_pipeline(
    CountVectorizer(),
    SVC()
)